Distributed metric data time rollup in real-time

ABSTRACT

In one aspect, a system for distributed consistent hash backed time rollup of performance metric data is disclosed. The system includes a plurality of collectors configured to receive, time series metrics data for a plurality of performance metrics from one or more agents instrumented into monitored applications; a plurality of aggregators communicatively connected to the collectors and configured to aggregate the received time series metric data for the plurality of performance metrics, wherein each aggregator is assigned to aggregate all received time series metrics data for one or more of the plurality of performance metrics; and a coordinator communicatively connected to the plurality of collectors and plurality of aggregators and configured to provide collectors with information on availability of the plurality of aggregators.

BACKGROUND

In pursuit of the highest level of service performance and userexperience, companies around the world are engaging in digitaltransformation by enhancing investments in digital technology andinformation technology (IT) services. By leveraging the global system ofinterconnected computer networks afforded by the Internet and the WorldWide Web, companies are able to provide ever increasing web services totheir clients. The web services may be provided by a web applicationwhich uses multiple services and applications to handle a giventransaction. The applications may be distributed over severalinterconnected machines, such as servers, making the topology of themachines that provide the service more difficult to track and monitor.

SUMMARY

Examples of implementations for distributed metric data time rollup inreal-time is disclosed. Specifically, the disclosed technology canenable distributed consistent hash backed metric time rollup mechanismin real time using read time resolution technique, with built in supportfor partial service failures, with high availability.

In one aspect, a system for distributed consistent hash backed timerollup of performance metric data is disclosed. The system includes aplurality of collectors configured to receive, time series metrics datafor a plurality of performance metrics from one or more agentsinstrumented into monitored applications; a plurality of aggregatorscommunicatively connected to the collectors and configured to aggregatethe received time series metric data for the plurality of performancemetrics, wherein each aggregator is assigned to aggregate all receivedtime series metrics data for one or more of the plurality of performancemetrics; and a quorum based coordinator communicatively connected to theplurality of collectors and plurality of aggregators and configured toprovide collectors with information on the plurality of aggregatorsincluding assignments of the performance metrics to each aggregator.

The system can be implemented in various ways to include one or more ofthe following features. For example, the aggregators can be arranged ina consistent hash ring using a hash function. The coordinator can detectwhether one of the plurality of aggregators has been removed from thehash ring or whether a new aggregator has been added to the hash ring.Each aggregator can perform time roll up of the received time seriesmetric data for the assigned one or more performance metric. Eachaggregator can receive the time series metrics data for the assigned oneor more of the plurality of performance metrics from two or more of theplurality of collectors. The coordinator can detect whether one of theplurality of aggregators has been removed from the hash ring. Thecollectors can be configured to apply the same hash function used toform the hash ring on the received time series metric data for a givenperformance metric to generate a hash code that points to a location onthe hash ring, so that one of the aggregators closest to the hash codeon the hash ring is assigned to process all of the received time seriesmetric data for the given performance metric. The collectors can beconfigured to use the hash code to route the received time series metricdata for each performance metric to corresponding aggregator node on theconsistent hash ring. The coordinator can include a quorum basedcoordinator service that provides information to the plurality ofcollectors about the information on all of the aggregators that areavailable to form the consistent hash ring. For example, when there are10 collectors and 5 aggregators in the system, all 10 collectors willget the same list of aggregators from the quorum based coordinator. Thecollectors will form the same consistent hash ring by using the samehash function on the aggregators (host and port info). When the timeseries metric data for a given metric arrives at a collector, thecollector applies the same hash function and routes the time seriesmetric data to the aggregator identified based on the hash function.Because the hash function is same for all 10 collectors, when the timeseries metric data for the given metric arrives at different collectorsfor each minute in a 10 minute window, all of the time series metricdata for the given metric will be routed to a single aggregator as longas the aggregator is available for processing.

The collectors can be totally stateless allowing for arbitrarily addingor removing of collectors to the metric processing system to collectmetrics from agents, depending on the load on the system. Thus, themetric processing system is scalable. Arranging the aggregators in aconsistent hash ring can allow the collectors to route every metrics tothe appropriate aggregator for all reported times. This arrangement canallow aggregators to roll up or aggregate metrics in various configuredrollup intervals.

In another aspect, a method for distributed consistent hash backed timerollup of performance metric data is disclosed. The method includesreceiving, at a plurality of collectors, time series metrics data for aplurality of performance metrics from one or more agents instrumentedinto monitored applications; aggregating, at a plurality of aggregatorscommunicatively connected to the collectors to form a hash ring, thereceived time series metrics data for the plurality of performancemetrics, wherein each aggregator is assigned to aggregate all receivedtime series metrics data for one or more of the plurality of performancemetrics; determining, at a coordinator communicatively connected to theplurality of collectors and the plurality of aggregators, whether thehash ring has changed; and communicating, at the coordinator,information on the determined change to the plurality of collectors.

The method can be implemented in various ways to include one or more ofthe following features. For example, the information can indicate thatone of the plurality of aggregators has been removed or a new aggregatorhas been added. The method can include redistributing, by thecollectors, the received time series metrics data based on theinformation. The redistributing can include forwarding the received timeseries metrics data for the one or more of the plurality of performancemetrics assigned to the removed aggregator to next aggregator in thehash ring starting from next data point (in time series) after removingthe aggregator. The method can include accumulating the time seriesmetrics data for the one or more of the plurality of performance metricsassigned to the removed aggregator received at the removed aggregatorbefore removing the aggregator to obtain an accumulated value for theremoved aggregator; and accumulating the time series metrics data forthe one or more of the plurality of performance metrics assigned to theremoved aggregator received at the next aggregator after removing theaggregator to obtain an accumulated value for the next aggregator. Themethod can include writing, to a database, the accumulated value for theremoved aggregator obtained from the time series metrics data receivedat the removed aggregator before removing the aggregator; and writing,to the database, the accumulated value for the next aggregator obtainedfrom the time series metrics data received at the next aggregator afterremoving the aggregator. The method can include merging the twoaccumulated values to perform a time roll up for a time period. Theredistributing can include forwarding the received time series metricsdata for the one or more of the plurality of performance metricsassigned to one of the plurality of aggregator to the newly addedaggregator in the consistent hash ring starting from next data pointafter adding the new aggregator. The method can include accumulating thetime series metrics data for the one or more of the plurality ofperformance metrics assigned to one of the aggregators received at theone of the aggregators before adding the new aggregator to obtain anaccumulated value for the one or the aggregators; and accumulating thetime series metrics data for the one or more of the plurality ofperformance metrics assigned to the one of the aggregators received atthe newly added aggregator after adding the aggregator to obtain anaccumulated value for the new aggregator. The method can includewriting, to a database, the accumulated value for the one of theaggregators obtained from the time series metrics data received at theone of the aggregators before adding the aggregator; and writing, to thedatabase, the accumulated value for the newly added aggregator obtainedfrom the time series metrics data received at the newly added aggregatorafter adding the newly added aggregator. The method can include mergingthe two accumulated values to perform a time roll up for a time period.

In yet another aspect, a non-transitory computer readable mediumembodying instructions is disclosed. When the instructions are executedby a processor, the instructions can cause operations to be performedincluding: receiving, at a plurality of collectors, time series metricsdata for a plurality of performance metrics from one or more agentsinstrumented into monitored applications; aggregating, at a plurality ofaggregators communicatively connected to the collectors to form aconsistent hash ring, the received time series metrics data for theplurality of performance metrics, wherein each aggregator is assigned toaggregate all received time series metrics data for one or more of theplurality of performance metrics; determining, at a coordinatorcommunicatively connected to the plurality of collectors and theplurality of aggregators, whether one of the plurality of theaggregators in the hash ring has crashed; and performing a repair job tofix data corruption caused by the crashed aggregator.

The non-transitory computer readable medium can be implemented toinclude one or more of the following features. For example, theoperations can include redistributing the received time series metricsdata for the one or more of the plurality of performance metricsassigned to the crashed aggregator to next aggregator in the hash ring.Performing the repair job can include obtaining the time series metricsdata received at the crashed aggregator before the crash; and mergingthe obtained time series metric data from the crashed aggregator withthe time series metrics data redistributed to the next aggregator in thehash ring. The operations can include splitting the time series metricsdata received at the crashed aggregator before the crash into smallertime series.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1A and 1B are a process flow diagrams showing an exemplarytechnique for distributed metric data time rollup in real-time.

FIGS. 2A and 2B are block diagrams illustrating time rollup ofdistributed metric data from granular raw data to progressively lessgranular data.

FIGS. 3A and 3B are block diagrams of tables illustrating exemplarymetric key time buckets for rolling up raw data to progressively lessgranular data.

FIG. 3C is an exemplary metric key.

FIG. 4 is a block diagram of an exemplary system for performingdistributed metric data time rollup in real-time.

FIG. 5A is a process flow diagram of a process for scaling the nodes ofaggregators by adding a new aggregator to the ring.

FIG. 5B is a block diagram illustrating an addition of a new aggregatordescribed in process 500 of FIG. 5A.

FIG. 5C is a process flow diagram of the process for fixing orpreventing data corruption.

FIG. 5D shows an exemplary column with two cells from two aggregators.

FIG. 6A is a process flow diagram of a process for scaling the nodes ofaggregators by removing an aggregator to the ring.

FIG. 6B is a block diagram illustrating the removal of the aggregatordescribed in process 600 of FIG. 6A.

FIG. 6C is a process flow diagram of the process 608 for fixing orpreventing data corruption due to the removed aggregator

FIG. 6D shows an exemplary column 630 with the two cells.

FIGS. 7A is a process flow diagrams of a process for repairing metricdata after an aggregator crashes.

FIG. 7B is a block diagram illustrating a crashed aggregator in a hashring of aggregator nodes.

FIGS. 7C is a process flow diagram of a process for performing therepair job.

FIG. 7D is a process flow diagram of a process for preparing for therepair job so that the repair job when performed is optimized.

FIG. 7E is a block diagram illustrating buckets created during therepair job preparation process.

FIG. 7F is a process flow diagram showing an exemplary process forperforming a repair job after a crash once the preparation for therepair job has been integrated into the process.

FIG. 8 is a block diagram of an exemplary application intelligenceplatform that can implement the distributed metric data time rollup inreal-time using the disclosed technology, including the processesdisclosed with respect to FIGS. 1A through 1D.

FIG. 9 is a block diagram of an exemplary implementation of theapplication intelligence platform for distributed metric data timerollup in real-time using the disclosed technology.

FIG. 10 is a block diagram of an exemplary computing system implementingthe disclosed technology.

DETAILED DESCRIPTION

Application intelligence platforms disclosed in this patent documentenable application performance monitoring and management using metricsdriven processes and systems. In the disclosed application intelligenceplatforms, monitoring of the raw application performance data isperformed by highly efficient instrumentation agents that have automaticcode injection capability to trace virtually every line of codes in agiven application. These agents automatically instrument millions oflines of code across thousands of tiers, in production environments. Theagents support a wide variety of languages and frameworks includingjava, .Net, Node.js, PHP, Ruby etc. In addition, the disclosedapplication intelligence platforms include browser and mobile agents tocollect end user monitoring information from browser or any mobiledevices.

The agents collect metrics data indicative of a number of applicationperformance measures and send the collected metrics data periodically toa metric processing engine at a collector. The metric processing enginerolls up or aggregates the metrics in different dimensions and createsmultiple views. The created views are extensively used for reporting andrule engine evaluation for health rule policies.

The created views can include at least two types, hierarchical clusterrollup view and time rollup view. In this patent document, the disclosedtechnology can enable the time rollup views. Time rollups are performedby rolling up and aggregating raw granular (e.g., 1-mintute interval)time series data into progressively less granular data. For example, theraw 1-minute time series data can be rolled up into 10-minute buckets(e.g., by averaging the 1-minute interval data every 10 minutes). Therolled up 10-minute buckets can be rolled up into hourly buckets (e.g.,by averaging every six 10-minute buckets into each hourly bucket). Therolled up hourly buckets can be rolled up into daily buckets (e.g., byaveraging every twenty-four 1-hour buckets into each daily bucket. Therolled up daily buckets can be rolled up into weekly buckets (e.g., byaveraging every seven daily buckets into each weekly bucket). The rolledup weekly buckets can be rolled into monthly buckets (e.g., by averagingevery 4 weekly buckets into each monthly bucket). The rolled up monthlybuckets can be rolled into yearly buckets (e.g., by rolling up every 12monthly buckets into each yearly bucket). In some implementations, othergranularities in the metrics rollup can be used.

The volume of performance metrics monitored and collected by the agentsis extremely high resulting in storage resource burdens. For example,the total number of performance metrics collected per minute by an agentfrom a large application can be in the range of 10 to 20 million datapoints. Because each performance metric has several performancestatistics, everyday system can ingest 2 to 4 TB of data. Storing thehigh-resolution (e.g., per minute) data for a long time is notpracticable and cost prohibitive. One solution to address the storageburden is to expire the highest resolution data after some period. Inorder to provide access to the performance metrics for a much longerperiod of time, the high resolution performance metrics are rolled upinto lesser resolutions (e.g., hourly, daily, weekly, monthly, yearly).Also, a health rule policy engine can be used to evaluate data based onbaseline and start deviation statistics for a longer a period of time.Table 1 below shows an exemplary data retention policy for differentresolution of performance metrics.

TABLE 1 Exemplary data retention policy for different resolution ofperformance data Metric Resolution Retention Period 1 minute 1 day 10minutes 8 days 60 minutes 1 year

For data collection rate of one per each minute, each agent can collect10 performance metrics data points over a period of 10 minutes, andperformance metrics 60 data points over a period of 60 minutes. Usingthe data retention policy of Table 1, the system will continuouslyrollup the 1-minute resolution metrics to 10-minutes resolution every 10minutes, and then to 60-minutes resolution every hour. The rollupprocess is performed for every metric collected by the agents from theapplication monitored by the agents. End users can access the collectedmetrics data by using queries that identifies a metric ID and a timerange. Depending on the time range requested, different rollup of thecollected data may need to be performed. For example, when the timerange specified is within 24 hours, the 1-minute resolution data (i.e.,the raw collected metrics data) can be returned. When the time rangespecified is within 8 days the 10-minute resolution data (i.e., 1-minuteresolution data rolled up every 10 minutes) can be returned. When thetime range specified is beyond 8 days, the 60-minute resolution data(e.g., the 10-minute resolution data rolled up every 60 minutes) can bereturned.

The disclosed technology can provide for a scalable metrics processingsystem that can dynamically scale out based on the load (number ofmetrics received per min). The disclosed metrics processing systemavoids any single point of failure by allowing a node failure to behandled gracefully. The disclosed metrics processing system does notcause any corruption of rolled up metric data. The disclosed metricsprocessing system is highly available.

The technology disclosed in this patent document provides for dynamicand efficient application intelligence platforms, systems, devices,methods, and computer readable media including non-transitory type thatembody instructions for causing a machine including a processor toperform various operations disclosed in this patent document to obtainthe desired application intelligence data. Specifically, the disclosedtechnology provides for distributed consistent hash backed metric timerollup mechanism in real time using read time resolution technique, withbuilt in support for partial service failures, with high availability.

Distributed Consistent Hash Backed Metric Time Rollup Techniques

FIGS. 1A and 1B are process flow diagrams of exemplary technique forperforming distributed consistent hash backed metric time rollup in realtime using read time resolution. The techniques disclosed in FIGS. 1Aand 1B are performed by a metrics processing system as disclosed withrespect to FIGS. 4A, 4B, and 4C. As shown in FIG. 1A, the technique 100for distributed consistent hash backed metric time rollup includesreceiving, at one or more collectors, monitored metrics data from one ormore agents instrumented into monitored applications at process 102. Themonitored metrics data are received in original high resolution (e.g.,1-minute resolution) by the collectors at the backend of the system, forexample. The received monitored metrics data in original high resolution(e.g., 1-minute resolution) are at least temporarily stored in adistributed database, such as HBase at process 104. The stored metricsdata in original high resolution are sent (e.g., by collectors) toaggregators organized into a consistent hash ring at process 106. Theaggregators perform appropriate rollup of the metrics data in originalhigh resolution into different resolutions at process 108. Examples ofrolling up metrics data into different resolutions are disclosed withrespect to FIGS. 2A and 2B.

All of the aggregators are registered with the coordinator, such asZooKeeper to identify the aggregators that are available to form aconsistent hash ring. As shown in FIG. 1B, the process 106 for sendingthe stored metrics data in original high resolution to aggregatorsorganized into a consistent hash ring includes obtaining information onthe aggregators that are available to form a hash ring at process 110.The has ring is formed by applying a has function to the availableaggregators at process 111. The collectors apply the same hash functionthat is used to create the consistent hash ring on the received timeseries of metric metric data at process 112 to return a hash code thatwill point to a location on the consistent hash ring. The aggregatorclosest to the hash code on the consistent hash ring will process themetric for all plurality of time. The collectors use this technique toroute the incoming time series metric data to the correct aggregatornode on the consistent hash ring at process 113. Thus, all time seriesdata for a particular metric are sent to the same aggregator until anaggregator fails. However, each aggregator may process multiple metrics.

FIGS. 2A and 2B are block diagrams illustrating examples 200 and 210 ofrolling up collected metrics data into different resolutions. Timerollups are performed by rolling up and aggregating raw granular (e.g.,1-mintute interval) time series data into progressively less granulardata. As shown in FIG. 2A, the raw collected data 202 in high resolution(e.g., t-minute resolution) are rolled up to obtain Rollup 1 data 204 byaveraging the collected data 202 every X ‘t-time’ interval. Thus, theRollup 1 data has a resolution that is reduced by 1/X compared to thecollected data 202. The Rollup 1 data are rolled up to obtain Rollup 2data by averaging the Rollup 1 data every Y ‘Rollup 1’ interval. Thus,the Rollup 1 data has a resolution that is reduced by 1/Y compared tothe Rollup 1 data 204 and 1/XY comparted to the collected data 202. TheRollup N data are obtained by rolling up Rollup N-1 data by averagingthe Rollup N-1 data every Z aollup N-1′ interval. Thus, the Rollup Ndata has a resolution that is reduced by 1/Z compared to the Rollup N-1data (not shown) or by 1/XZ compared to the collected data 202. In thismanner, each Rollup data of a given resolution is rolled up to the nextlower resolution.

FIG. 2B shows an example 210 of the rollup process illustrated in FIG.2A. As shown in FIG. 2A, the raw 1-minute resolution time series Rawdata 212 can be rolled up into 10-minute resolution data buckets (e.g.,by averaging the 1-minute interval data every 10 minutes) 214. Therolled up 10-minute resolution buckets can be rolled up into hourlyresolution data buckets (e.g., by averaging every six 10-minute bucketsinto each hourly bucket) 216. The rolled up hourly buckets can be rolledup into daily resolution buckets (e.g., by averaging every twenty-four1-hour buckets into each daily bucket (not shown). The rolled up dailybuckets can be rolled up into weekly resolution data buckets (e.g., byaveraging every seven daily buckets into each weekly bucket) (notshown). The rolled up weekly buckets can be rolled into monthlyresolution data buckets (e.g., by averaging every 4 weekly buckets intoeach monthly bucket) (not shown). The rolled up monthly buckets can berolled into yearly resolution buckets (e.g., by rolling up every 12monthly buckets into each yearly bucket) 218. In some implementations,other granularities in the metrics rollup can be used.

Distributed Time Rollup Design

The disclosed technology can provide for a metrics processing systemthat uses a micro services architecture that ingests metrics data inreal time and time rolls up in the data stream. The results of therollup are stored in a distributed database, such as HBase. Storing theresults of rollups in HBase provides advantages. For example, HBaseenables a sharding strategy to create ordered partitioning of its keyranges. Using the sharding strategy, long time range queries (e.g., overweeks, months, years, etc.) can be implemented and aggregates can beefficiently applied on the results of the long time range queries at asingle shard level. All keys are lexicographically sorted and stored inshards called regions by range. The keys can be designed to storeseveral years of metric data in a single shard or region in HBase.

Exemplary HBase Table design

FIGS. 3A and 3B are diagrams showing exemplary table designs 300 and 310for HBase. For example, the HBase table 300 can have multiple columnfamilies (e.g., 1, 2, . . . , and N)—one column for each dataresolution. Each column family has a different TTL configurationmatching a corresponding retention period for a different metricresolution. The example table 310 in FIG. 3B has 3 column families 1, 2,and 3. Each column family has a different TTL configuration—(ColumnFamily 1: TTL 24 hours), (Column Family 2: TTL 8 days), and (ColumnFamily 3: TTL 1 year)—matching the retention period for a differentmetric resolution. For example, in table 310, the 1-minute resolutiondata can be written into the first column family having retention period1 day. After rolling up the metrics for 10 min and 60 minutes, therespective rollups can be stored in 2nd and 3rd column families. HBasecan use the retention periods of the columns to automatically deletesmetrics.

Exemplary HBase Metric Key Design

The metrics data received from the agents is time series data whereevery metric data point received has a time stamp associated with it. Ametric key can be created to be time bucketed, and values can be storedagainst columns for each minutes. To avoid hot spotting, the metrics IDand the source before the time. In addition, a prefix salt based on theinitial size of the HBase cluster can be used and the table can bepre-split at the time of setup. For example, when the metric key timebucket is set up to be 12 hours as shown in FIG. 3B, 720 (i.e., 60minutes X 12 hours) metric data point are received in 12 hours. Thesemetric values can be stored as different column values depending on theresolution assigned for each column. For example, the column names canbe set as the metric payload time. In FIG. 3B, the minute resolutionlevel metric values are stored in the first column family. An exemplarymetric key can be designed as shown in FIG. 3C to include: [prefix salt][metric identity][12 hours time since epoch].

The 1-minute resolution metrics data from the first column family arerolled up to 10-minutes resolution metric data and stored in the 2ndcolumn family every 10 minutes. Similarly, the 60-minutes rolled upmetric data points are written to the 3rd column family every hour.

FIG. 4 is a block diagram illustrating a metrics data processing system400 for rolling up metrics in real-time as disclosed in this patentdocument. The metrics data processing system 400 can be implemented inan application platform 401. Examples of the application platform aredisclosed further with respect to FIGS. 6 and 7 below. The metricsprocessing system 400 is a backend system that receives metrics data intime series (m1 (t1-tn), m2 (t1-tn), m3 (t1-tn), . . . , mA (t1-tn))from ‘A’ total agents 1, 2, 3, . . . , A (402, 404, 406, . . . , 408).The metrics processing system 400 can use Jetty based micro services torollup the received metrics in real time. The Jetty based micro servicesin the metric processing system 400 can be organized into two groups—‘C’total collectors C1 (410), C2 (412), . . . , C5 (414), . . . CC (416)and ‘B’ total aggregators a1 (418), a2 (420), a3 (422), and a4 (416).The numbers represented by n, N, A, C, and B can vary based on thedeployment of the application intelligence platform 401. Also, n, N, A,C, and B may or may not be the same number.

Collectors 410, 412, 414, and 416 apply the same hash function that isused to create the consistent hash ring on the received metricidentities. The hash of the metric identity will return a hash code thatwill point to a location on the consistent hash ring. The aggregatorclosest to the hash code on the consistent hash ring will process themetric for all plurality of time. The collectors use this technique toroute the incoming metrics to the correct aggregator node on theconsistent hash ring.

In some implementations, the collector services 410, 412, 414, and 416are placed behind an optional load balancer 409 that receives anddistributes metric payload data coming from the instrumentation agents402, 404, 406, and 408. The load balancer 409 can distribute the metricpayload data from the agents 402, 404, 406, and 408 across thecollectors 410, 412, 414, and 416 based on the load of each collector.When the load balancer 409 is not used at the backend, the agents 402,404, 406, and 408 can include the load balancing process oralternatively, each agent can be assigned to one or more specificcollectors. The collectors 410, 412, 414, and 416 process the incomingmetric payload data and persists the received metric payload data intothe HBase table in the 1st column family. The metric payload datapersisting in the 1^(st) column family are 1-minute resolution metricsdata. The collectors 410, 412, 414, and 416 send the 1-minute resolutionmetric data to the aggregator services a1 (418), a2 (420), a3 (422), andaB (424).

The aggregator services 418, 420, 422, and 424 perform time rollup ofthe 1-minute resolution metrics data and persist them to 2nd and 3rdcolumn families respectively in the HBase table (e.g., HBase Tables 300and 310) in a database 430. The aggregator services 418, 420, 422, and424 are organized into a consistent hash ring. Each of the aggregatorservices 418, 420, 422, and 424 can process a certain range in the hashring or metric range. Processing a certain range in the hash ring ormetric range ensures that any metric received at any collector at anytime, will always be forwarded to the same aggregator service forperforming the time rollup. For example, a single aggregator servicereceives all of the ten 1-minute metric data points for a metric, applyaggregate functions on the metric values, and save the aggregated valueinto HBase in the appropriate column family.

As shown in FIG. 4, each of the collectors 410, 412, 414, and 416 canreceive metric data for multiple metrics. However, each of thecollectors 410, 412, 414, and 416 may not receive all of the metric datapoints for a particular metric. This is due to the distribution of themetric payload through the load balancer 409, for example, to providedistributed processing of the large amount of data received from theagents 402, 404, 406, and 408. However, the collectors 410, 412, 414,and 416 can synchronize to forward metric data points for a given metricto the same aggregator 418, 420, 422, and 424.

In the examples system 400 shown in FIG. 4, the collectors [C1, C2 . . ., CC] are placed behind a load balancer 409 and aggregators [a1, a2, a3,. . . aB] are organized into a consistent hash ring. One instrumentationagent (Agent 1 (402)) is sending metric m1 every minute from time t1till tn, another agent (Agent 2 (404)) is sending metric m2 every minutefrom time t1 till tn, a third agent (Agent 3 (406)) is sending metric m3every minute from time t1 till tn, and so on until agent A (Agent A(408)) is sending metric mA every minute from time t1 till tn.

Payload p1 containing metrics m1, m2, and m3 sent at 10:00 am arrived atcollector C5 (414). Payload p2 containing metrics m1, m2, and m3 sent at10:01 am arrived at collector C1 (410). Payload p3 containing metricsm1, m2, and m3 sent at 10:02 am arrived at collector c2. Thus, in thisexample, for each minute, the payload containing metrics m1, m2, and m3landed at different collector nodes as shown above. Each collectorapplies a hash function based on the number of aggregators and routesthe metric m1 to aggregator a1 (418), routes the metric m2 to aggregatora3 (422), and routes the metric m3 to aggregator aB (424). After 10minutes, each of the aggregators a1 (418), a3 (422), and aB (424) canaggregate the 10 metric values for m1, m2, and m3 and write theaggregated values for m1, m2, and m3 to HBase. Similarly, after 60minutes, aggregators a1 (418), a3 (422), and aB (424) can aggregate the60 metric values for m1, m2, and m3 and write the aggregated values form1, m2, and m3 to HBase.

Aggregators 418, 420, 422, and 424 are implemented used to aggregate themetric data for each metric rather than the collectors becausecollectors are stateless services that are placed behind a load balancerthat distributes the load using a round robin rule. Each 1-minutepayload for a metric sent from an instrumentation agent may end up beingprocessed at different collectors based on the load balancing performedby the load balancer 409. Time rollups at the collector services for ametric cannot be performed because a single collector process may nothave all 10 of the 1-minute level data points, for example. Routing themetric payloads from the collectors to the aggregators based on the hashcode for every minute enables the metric processing system 400 toperform time roll up at the aggregator services 418, 420, 422, and 424.

Coordinating Collectors and Aggregators

The collectors 410, 412, 414, and 416 and aggregators 418, 420, 422, and424 register with the coordinator 426, such as the ZooKeeper basedquorum service at the start up. The coordinator 426 monitors the serviceavailability for all of the micro services offered by the collectors410, 412, 414, and 416 and aggregators 418, 420, 422, and 424. Thecollectors 410, 412, 414, and 416 retrieve the list of aggregators 418,420, 422, and 424 at the start up from the coordinator 426. Eachcollector attaches a watcher to receive notification during anaggregator joining the quorum of the coordinator 426 or leaving thequorum. At any given point in time, all collector service nodes 410,412, 414, and 416 will retrieve the same set of aggregator nodes 418,420, 422, and 424 from coordinator 426. The collector nodes 410, 412,414, and 416 can create a consistent hash ring using the retrievedaggregator nodes.

The coordination mechanism works well when the aggregator nodes 410,412, 414, and 416 remain constant to allow all metric data points for ametric to be sent to the same aggregator. When the aggregator nodes 410,412, 414, and 416 change by adding to or subtracting from the hash ringof aggregators, the metric processing system 400 is scalable todynamically add or subtract aggregators restfully without aninterruption.

The makeup of the aggregator nodes can change for various reasons. Forexample, when the load increases, new aggregator nodes can be added tothe ring. When the load decreases, one or more of the nodes can beremoved from the ring.

In some implementations, during software upgrades, older aggregators canbe removed from the ring and newer aggregators can be added to the ringcontinuously.

In some implementations, one or more aggregators might crash and newaggregators can be added to the ring to compensate for crashed nodes.

FIG. 5A is a process flow diagram of a process 500 for scaling the nodesof aggregators by adding a new aggregator to the ring. FIG. 5B is ablock diagram illustrating the addition of the new aggregator describedin process 500 of FIG. 5A.

In FIG. 5B, 4 aggregator nodes a1 (522), a2 (524), a3 (526), and a4(528) form a consistent hash ring 520, but aggregator a2 (524) was addedto the ring using the process 500 of FIG. 5A. Based on a condition, suchas an increase in the load, a determination is made to add a newaggregator at process 502. A new aggregator node a2 (524) is added tothe ring at process 504. In the example shown in FIG. 5B, the newaggregator node a2 (524) is added to the ring at 10:15 am. The addedaggregator a2 (524) is indicated using a double ring in FIG. 5B. Addingthe new aggregator a2 (524) requires redistributing the metric payloaddata to include the new aggregator a2 (524) at process 506. For example,as shown in FIG. 5B, any metric payload data for m1 arriving at a3 (526)will now go to aggregator a2 (524) after 10:15 am. If, for example,aggregator a3 (526) has received 15 data points for metric m1 beforeaggregator a2 (524) was added to the ring, aggregator a3 (526) canaggregate 15 minutes of data for the hour rollup and the remaining 45minutes of metric data points will be rolled up by aggregator a2 (524).If both aggregators a2 (524) and a3 (526) are allowed to write the hourrolled up data into the 3rd column family in HBase with same column name“10:00”, one aggregator will over write other corrupting the hour rollupvalue for 10:00. This is because any change in the aggregator consistenthash ring will create corruption of the time rolled up data due to thesplitting of metric payload data into multiple aggregators. The process500 includes fixing or preventing data corruption at process 508.

FIG. 5C is a process flow diagram of the process 508 for fixing orpreventing data corruption. The corrupted data fixing or preventionprocess 508 includes adding an aggregator suffix to the column name inHBase at process 510 before writing to the next rollup column family(e.g., for hourly roll up) at process 512. Using the aggregator suffixcreates two cells (one from each aggregator) for the hour time rollupwith different column names. For example, one cell in the column can benamed “10:00_a3” to indicate m1 data written by aggregator a3 (526) with15 metric data points aggregated. The other cell in the column can benamed “10:00_a2” to indicate metric data written by aggregator a2 (524)with 45 data points aggregated. FIG. 5D shows an exemplary column 530with the two cells. The values in the two cells are merged at process514. For example, a read scan will pick up the values from both cells,and merge the values in real time before returning the results to theclient. Using the read time resolution method two or more aggregatedcells can be merged during scans.

FIG. 6A is a process flow diagram of a process 600 for scaling the nodesof aggregators by removing an aggregator to the ring. FIG. 6B is a blockdiagram illustrating the removal of the aggregator described in process600 of FIG. 6A.

In FIG. 6B, 4 aggregator nodes a1 (622), a2 (624), a3 (626), and a4(628) form a consistent hash ring 620, but aggregator a2 (624) is beingremoved from to the ring using the process 600 of FIG. 6A. Based on acondition, such as a decrease in the load, a determination is made toremove an aggregator at process 602. The desired aggregator is removedat process 604. Due to the removed aggregator, the metric payload datais redistributed at process 606. Metric values for the next minute afterremoval of aggregator a2 (624) will route to the next aggregator a3(626) in the ring. The next aggregator a3 (626) will aggregate theremaining minutes of the hour and the aggregated values from bothaggregators will be written to the HBase. Because the removal of theaggregator corrupts the rolled up data by creating two aggregatedvalues, the process 600 includes fixing the corrupted data at process608.

FIG. 6C is a process flow diagram of the process 608 for fixing orpreventing data corruption due to the removed aggregator. After the nodeaggregator is removed from the consistent hash ring, the accumulatedmetric values from removed aggregator and the next aggregator that tookover the accumulation are written into the database (e.g., HBase) atprocess 612 before terminating the process for the removed aggregator atprocess 614. To prevent or fix data corruption for the two sets ofaccumulated data, the column names in the database will have theaggregator name as a suffix at process 610 before writing theaccumulated metric values to the database at process 612. The aggregatorname suffix usage is similar to the approach for adding a new aggregatornode to the ring as disclosed with respect to FIG. 5C above.

Using the aggregator suffix creates two cells (one from each aggregator)for the hour time rollup with different column names. FIG. 6D shows anexemplary column 630 with the two cells. The values in the two cells aremerged at process 616. For example, a read scan will pick up the valuesfrom both cells, and merge the values in real time before returning theresults to the client. Using the read time resolution method two or moreaggregated cells can be merged during scans.

The disclosed technology also provides for means to repair time rolledup data corruption when an aggregator node crashes. In the event of anaggregator node crash, the crashed aggregator node would not be able towrite the accumulated metric values into the database (e.g., HBase).FIG. 7A is a process flow diagram of a process for repairing data aftera crash 700. FIG. 7B is a block diagram of an exemplary has ring ofaccumulators 720 that illustrates an aggregator crash. For example, inthe hash ring 720 of FIG. 7B, when aggregator a2 crashes at time 10:25am, all the metric values for m1 being sent to a2 will be redistributedto aggregator a3 from the next minute (10:26 am). After the crash,aggregator a3 will receive and accumulate metric values from 10:26 am to10:59 am—total of 35 data points. Aggregator a3 will write the metricdata points into HBase for 35 data points. The initial 25 minutes ofdata for m1 received by accumulator a2 will be lost with the crashedaggregator a2.

FIGS. 7C is a process flow diagram of the process 704 for performing therepair job. FIG. 7D is a process flow diagram of a process 712 forpreparing for the repair job so that the repair job when performed isoptimized. FIG. 7E is a block diagram illustrating buckets 730 createdduring the repair job preparation process. FIG. 7F is a process flowdiagram showing an exemplary process 740 for performing a repair jobafter a crash once the preparation for the repair job has beenintegrated into the process.

For the process 700 shown in FIG. 7A, because the aggregators andcollectors register themselves with the coordinator quorum service 426and monitors the availability using watchers, the coordinator 426 candetect a node crash at process 702 and then perform a repair job atprocess 704. As shown in FIG. 7C, the repair job 704 includes replayingthe not yet rolled up payload that arrived at the crashed accumulator a2before the crash (e.g., from 10:00 am to 10:25 am) at process 706. Thedata received at the crashed accumulator a2 from 9:50 am to 9:59 shouldhave been rolled up already. The 25 minutes of metric data for thosemetrics that have been lost with aggregator a2 crash obtained from thereplay is sent to aggregator a3 at process 708 to be aggregated. Theportion of 25 minutes received from the crashed aggregator is saved intoHBase at process 710.

The repair job could be very expensive depending on the time of thecrash. For example, when an aggregator crashes at 10:59 am, the repairjob has to replay the entire 1-hour of payloads received. The disclosedtechnology provides for a technique 712 for preparing for the repair jobby optimizing the replay process during the repair job as shown in FIG.7D. To prepare for the repair job at process 712, the time range (e.g.,1- hour time range) is partitioned into multiple buckets (e.g., 6buckets of 10 minutes each) that breaks up the time range into smallermanageable chunks at each aggregator at process 714. For the aboveexample where aggregator a2 crashed, the first bucket can store metricdata from 10:00 am to 10:09 am; the second bucket can store metric datafrom 10:10 am to 10:19 am; the third bucket can store metric data from10:20 am to 10:29; the fourth bucket can store metric data from 10:30 amto 10:39 am; the fifth metric data can store metric data from 10:40 amto 10:49 am; and the 6th bucket can store the metric data from 10:50 amto 10:59 am. To have buffer beyond the one-hour time range and into thenext hour, additional buckets (e.g., another 3 buckets for 30 minutes ofnext hour) are added to the list at process 716. The values in eachbucket are accumulated at each accumulator and written to the database,such as HBase. Specifically, the metric data values in the 1^(st) bucketare accumulated at process 718. An aggregator suffix is used identifythe accumulator value at process 720 before writing to the database atprocess 722. When determined that more buckets are available at process724, the metric data values in the next bucket are accumulated atprocess 726 and merged with the accumulated values from the previousbucket at process 728. The same aggregator suffix is used to write themerged value to the database, such as HBase at processes 720 and 722.The merged value overrides the previously stored value in HBase. Thevalues in each of the subsequent buckets are accumulated and merged withprevious bucket until all of the buckets are processed.

FIG. 7E shows an exemplary repair process using buckets 730 createdduring the preparation process. For example, from 10:00 am to 10:09 am,the aggregator accumulates all 10 metric data points in the 1st bucketb1. At 10:10 am, the aggregator writes to HBase the first 10 data pointaggregated value from the 1^(st) bucket b1 with an aggregator suffixadded to the column name. From 10:10 am to 10:19 am, the aggregator canaccumulate all 10 metric data points in the 2nd bucket b2. At 10:20 am,the aggregator can merge the metric values from the 1^(st) and secondbuckets b1 and b2 and write to HBase overwriting the first value writtenafter 10 minutes. After every 10-minute window, the metric valuesaccumulated in the current bucket are merged with all the previousbuckets and then written to HBase overwriting the previous value. Thesame aggregator suffix, e.g. 10:00_a2, is used as the column name. Theaccumulating, merging and writing the metric values from each bucketcontinues until all buckets for the entire time range (e.g., 1 hour inthis crash example) have been processed.

The added buckets b7, b8, and b9 are used to store the next Hoursaccumulated metric values. After 11:30 am (the end of the added bucketsin this example), new metric values can be written to bucket 1, and soon to repeat the process. After processing the 9 buckets, 1 hour 30minutes of aggregated data are stored in the aggregator memory.

FIG. 7F is a process flow diagram of a process 740 for repairing dataafter an aggregator crashes and once the preparation process 712 hasbeen performed to partition the time range. Due to the preparationprocess, when aggregator a2 crashes (702) at 10:25 am, the only bucketthat could be corrupted would be the 3rd bucket b3. The crashedaggregator a2 would have already written the rolled up data till 10:20am in column “10:00_a2” using the aggregator suffix. After aggregator a2crashes, from 10:25 am forward, all metric data values are sent toaggregator a3 till 10:59 am, the next roll up time. At aggregator a3,buckets 4, 5 and 6 would have correct aggregated metric values. Thethird bucket b3 will have only 5 data points and that will be incorrect.

Thanks to the preparation process 712 of partitioning the time range,the repair job 704 can be performed for only a 10-minute time range10:20 am to 10:29 am. Thus, the data is replayed for only the 10 minutesin the event of an aggregator node crash. The accumulated buckets inaggregator a3 can be written to HBase with column name having suffix a3,e.g. “10:00_a2”. In some implementations, the number of data points ineach bucket can vary depending on the time range to be partitioned andthe number of buckets created. The read scans can merge both cells (fromboth aggregators) in real time and send the results to the clientprogram. The read time resolution method is same as described above forFIGS. 5C and 6C.

Exemplary Advantages

The disclosed technology for performing distributed time rollups ofmetrics data in real-time enables the movement of the rollup processingfrom away from the database, such as HBase to the new micro servicesincluding the aggregators. Performing the rollup processing in theaggregators can free up computing requirements in the HBase system,which can lead to removal of computing power from the HBase system.Because database nodes with computing power is more expensive, themetric processing costs can be reduced significantly.

Application Intelligence Platform Architecture

The disclosed technology for distributed metric data time rollup inreal-time can be implemented in a metric processing system incommunication with the agents and controllers of an applicationintelligence platform. As shown in FIG. 4, the metric processing systemcan include collectors and aggregators behind a load balance. FIG. 8 isa block diagram of an exemplary application intelligence platform 800that can implement the distributed metric data time rollup in real-timeas disclosed in this patent document. The application intelligenceplatform is a system that monitors and collect metrics of performancedata for an application environment being monitored. At the simpleststructure, the application intelligence platform includes one or moreagents 810, 812, 814, 816 and one or more controllers 820. While FIG. 8shows four agents communicatively linked to a single controller, thetotal number of agents and controller can vary based on a number offactors including the number of applications monitored, how distributedthe application environment is, the level of monitoring desired, thelevel of user experience desired, etc.

Controllers, Agents, and Metric Processing System

The controller 820 is the central processing and administration serverfor the application intelligence platform. The controller 820 serves abrowser-based user interface (UI) 830 that is the primary interface formonitoring, analyzing, and troubleshooting the monitored environment.The controller 820 can control and manage monitoring of businesstransactions distributed over application servers. Specifically, thecontroller 820 can receive runtime data from agents 810, 812, 814, 816and coordinators, associate portions of business transaction data,communicate with agents to configure collection of runtime data, andprovide performance data and reporting through the interface 830. Theinterface 830 may be viewed as a web-based interface viewable by aclient device 840. In some implementations, a client device 840 candirectly communicate with controller 820 to view an interface formonitoring data. The controller can communicate with the multiple agentsreturn the appropriate pre-fetch application performance data responsiveto the request for the pre-fetch application performance data. In someimplementations, an application may touch more than one machine and thusapplication performance data from multiple agents can be combinedtogether by the controller.

In the Software as as Service (SaaS) implementation, a controllerinstance 820 is hosted remotely by a provider of the applicationintelligence platform 800. In the on-premise (On-Prem) implementation, acontroller instance 820 is installed locally and self-administered.

The controllers 820 receive data from different agents 810, 812, 814,816 deployed to monitor applications, databases and database servers,servers, and end user clients for the monitored environment. Any of theagents 810, 812, 814, 816 can be implemented as different types ofagents specific monitoring duties. For example, application agents areinstalled on each server that hosts applications to be monitored.Instrumenting an agent adds an application agent into the runtimeprocess of the application.

Database agents are software (e.g., Java program) installed on a machinethat has network access to the monitored databases and the controller.Database agents queries the databases monitored to collect metrics andpasses the metrics for display in the metric browser—database monitoringand in the databases pages of the controller UI. Multiple databaseagents can report to the same controller. Additional database agents canbe implemented as backup database agents to take over for the primarydatabase agents during a failure or planned machine downtime. Theadditional database agents can run on the same machine as the primaryagents or on different machines. A database agent can be deployed ineach distinct network of the monitored environment. Multiple databaseagents can run under different user accounts on the same machine.

Standalone machine agents are standalone programs (e.g., standalone Javaprogram) that collect hardware-related performance statistics from theservers in the monitored environment. The standalone machine agents canbe deployed on machines that host application servers, database servers,messaging servers, Web servers, etc. A standalone machine agent has anextensible architecture.

End user monitoring (EUM) is performed using browser agents and mobileagents to provide performance information from the point of view of theclient, such as a web browser or a mobile native application. Browseragents and mobile agents are unlike other monitoring through applicationagents, database agents, and standalone machine agents that being on theserver. Through EUM, web use (e.g., by real users or synthetic agents),mobile use, or any combination can be monitored depending on themonitoring needs.

Browser agents are small files using web-based technologies, such asJavaScript agents injected into each instrumented web page, as close tothe top as possible, as the web page is served and collects data. Oncethe web page has completed loading, the collected data is bundled into abeacon and sent to the EUM cloud for processing and ready for retrievalby the controller. Browser real user monitoring (Browser RUM) providesinsights into the performance of a web application from the point ofview of a real or synthetic end user. For example, Browser RUM candetermine how specific Ajax or iframe calls are slowing down page loadtime and how server performance impact end user experience in aggregateor in individual cases.

A mobile agent is a small piece of highly performant code that getsadded to the source of the mobile application. Mobile RUM providesinformation on the native iOS or Android mobile application as the endusers actually use the mobile application. Mobile RUM providesvisibility into the functioning of the mobile application itself and themobile application's interaction with the network used and anyserver-side applications the mobile application communicates with.

The application intelligence platform 800 can include a metricprocessing system 850 that includes controllers 852 and aggregators 854.The metric processing system 850 can be implemented substantiallysimilar to the metric processing system 400 shown in FIG. 4. Asdisclosed with respect to FIG. 4, the metric processing system 400 and850 can perform the distributed metric data time rollup in real-time asdisclosed in this patent document. The metric processing system 850 cancommunicate with controller 820 to provide the time rolled up data tothe controller when requested. In some implementations, the metricpayload data from the agents are received through the controller 820.

Application Intelligence Monitoring

The disclosed technology can provide application intelligence data bymonitoring an application environment that includes various servicessuch as web applications served from an application server (e.g., Javavirtual machine (JVM), Internet Information Services (IIS), HypertextPreprocessor (PHP) Web server, etc.), databases or other data stores,and remote services such as message queues and caches. The services inthe application environment can interact in various ways to provide aset of cohesive user interactions with the application, such as a set ofuser services applicable to end user customers.

Application Intelligence Modeling

Entities in the application environment (such as the JBoss service,MQSeries modules, and databases) and the services provided by theentities (such as a login transaction, service or product search, orpurchase transaction) are mapped to an application intelligence model.In the application intelligence model, a business transaction representsa particular service provided by the monitored environment. For example,in an e-commerce application, particular real-world services can includeuser logging in, searching for items, or adding items to the cart. In acontent portal, particular real-world services can include user requestsfor content such as sports, business, or entertainment news. In a stocktrading application, particular real-world services can includeoperations such as receiving a stock quote, buying, or selling stocks.

Business Transactions

A business transaction representation of the particular service providedby the monitored environment provides a view on performance data in thecontext of the various tiers that participate in processing a particularrequest. A business transaction represents the end-to-end processingpath used to fulfill a service request in the monitored environment.Thus, a business environment is a type of user-initiated action in themonitored environment defined by an entry point and a processing pathacross application servers, databases, and potentially many otherinfrastructure components. Each instance of a business transaction is anexecution of that transaction in response to a particular user request.A business transaction can be created by detecting incoming requests atan entry point and tracking the activity associated with request at theoriginating tier and across distributed components in the applicationenvironment. A flow map can be generated for a business transaction thatshows the touch points for the business transaction in the applicationenvironment.

Performance monitoring can be oriented by business transaction to focuson the performance of the services in the application environment fromthe perspective of end users. Performance monitoring based on businesstransaction can provide information on whether a service is available(e.g., users can log in, check out, or view their data), response timesfor users, and the cause of problems when the problems occur.

Business Applications

A business application is the top-level container in the applicationintelligence model. A business application contains a set of relatedservices and business transactions, In some implementations, a singlebusiness application may be needed to model the environment. In someimplementations, the application intelligence model of the applicationenvironment can be divided into several business applications. Businessapplications can be organized differently based on the specifics of theapplication environment. One consideration is to organize the businessapplications in a way that reflects work teams in a particularorganization, since role-based access controls in the Controller UI areoriented by business application.

Nodes

A node in the application intelligence model corresponds to a monitoredserver or JVM in the application environment. A node is the smallestunit of the modeled environment. In general, a node corresponds to anindividual application server, JVM, or CLR on which a monitoring Agentis installed. Each node identifies itself in the applicationintelligence model. The Agent installed at the node is configured tospecify the name of the node, tier, and business application under whichthe Agent reports data to the Controller.

Tiers

Business applications contain tiers, the unit in the applicationintelligence model that includes one or more nodes. Each node representsan instrumented service (such as a web application). While a node can bea distinct application in the application environment, in theapplication intelligence model, a node is a member of a tier, which,along with possibly many other tiers, make up the overall logicalbusiness application.

Tiers can be organized in the application intelligence model dependingon a mental model of the monitored application environment. For example,identical nodes can be grouped into a single tier (such as a cluster ofredundant servers). In some implementations, any set of nodes, identicalor not, can be grouped for the purpose of treating certain performancemetrics as a unit into a single tier.

The traffic in a business application flows between tiers and can bevisualized in a flow map using lines between tiers. In addition, thelines indicating the traffic flows between tiers can be annotated withperformance metrics. In the application intelligence model, there maynot be any interaction among nodes within a single tier. Also, in someimplementations, an application agent node cannot belong to more thanone tier. Similarly, a machine agent cannot belong to more than onetier. However, more than one machine agent can be installed on a machine

Backend System

A backend is a component that participates in the processing of abusiness transaction instance. A backend is not instrumented by anagent. A backend may be a web server, database, message queue, or othertype of service. The agent recognizes calls to these backend servicesfrom instrumented code (called exit calls). When a service is notinstrumented and cannot continue the transaction context of the call,the agent determines that the service is a backend component. The agentpicks up the transaction context at the response at the backend andcontinues to follow the context of the transaction from there.

Performance information is available for the backend call. For detailedtransaction analysis for the leg of a transaction processed by thebackend, the database, web service, or other application need to beinstrumented.

Baselines and Thresholds

The application intelligence platform uses both self-learned baselinesand configurable thresholds to help identify application issues. Acomplex distributed application has a large number of performancemetrics and each metric is important in one or more contexts. In suchenvironments, it is difficult to determine the values or ranges that arenormal for a particular metric; set meaningful thresholds on which tobase and receive relevant alerts; and determine what is a “normal”metric when the application or infrastructure undergoes change. Forthese reasons, the disclosed application intelligence platform canperform anomaly detection based on dynamic baselines or thresholds.

The disclosed application intelligence platform automatically calculatesdynamic baselines for the monitored metrics, defining what is “normal”for each metric based on actual usage. The application intelligenceplatform uses these baselines to identify subsequent metrics whosevalues fall out of this normal range. Static thresholds that are tediousto set up and, in rapidly changing application environments,error-prone, are no longer needed.

The disclosed application intelligence platform can use configurablethresholds to maintain service level agreements (SLAs) and ensureoptimum performance levels for your system by detecting slow, very slow,and stalled transactions. Configurable thresholds provide a flexible wayto associate the right business context with a slow request to isolatethe root cause.

Health Rules, Policies, and Actions

In addition, health rules can be set up with conditions that use thedynamically generated baselines to trigger alerts or initiate othertypes of remedial actions when performance problems are occurring or maybe about to occur.

For example, dynamic baselines can be used to automatically establishwhat is considered normal behavior for a particular application.Policies and health rules can be used against baselines or other healthindicators for a particular application to detect and troubleshootproblems before users are affected. Health rules can be used to definemetric conditions to monitor, such as when the “average response time isfour times slower than the baseline”. The health rules can be createdand modified based on the monitored application environment.

Examples of health rules for testing business transaction performancecan include business transaction response time and business transactionerror rate. For example, health rule that tests whether the businesstransaction response time is much higher than normal can define acritical condition as the combination of an average response timegreater than the default baseline by 3 standard deviations and a loadgreater than 50 calls per minute. This health rule can define a warningcondition as the combination of an average response time greater thanthe default baseline by 2 standard deviations and a load greater than100 calls per minute. The health rule that tests whether the businesstransaction error rate is much higher than normal can define a criticalcondition as the combination of an error rate greater than the defaultbaseline by 3 standard deviations and an error rate greater than 10errors per minute and a load greater than 50 calls per minute. Thishealth rule can define a warning condition as the combination of anerror rate greater than the default baseline by 2 standard deviationsand an error rate greater than 5 errors per minute and a load greaterthan 50 calls per minute.

Policies can be configured to trigger actions when a health rule isviolated or when any event occurs. Triggered actions can includenotifications, diagnostic actions, auto-scaling capacity, runningremediation scripts.

Metrics

Most of the metrics relate to the overall performance of the applicationor business transaction (e.g., load, average response time, error rate,etc.) or of the application server infrastructure (e.g., percentage CPUbusy, percentage of memory used, etc.). The Metric Browser in thecontroller UI can be used to view all of the metrics that the agentsreport to the controller.

In addition, special metrics called information points can be created toreport on how a given business (as opposed to a given application) isperforming. For example, the performance of the total revenue for acertain product or set of products can be monitored. Also, informationpoints can be used to report on how a given code is performing, forexample how many times a specific method is called and how long it istaking to execute. Moreover, extensions that use the machine agent canbe created to report user defined custom metrics. These custom metricsare base-lined and reported in the controller, just like the built-inmetrics.

All metrics can be accessed programmatically using a RepresentationalState Transfer (REST) API that returns either the JavaScript ObjectNotation (JSON) or the eXtensible Markup Language (XML) format. Also,the REST API can be used to query and manipulate the application.environment.

Snapshots

Snapshots provide a detailed picture of a given application point intime. Snapshots usually include call graphs that allow that enablesdrilling down to the line of code that may be causing performanceproblems. The most common snapshots are transaction snapshots.

Exemplary Implementation of Application Intelligence Platform

FIG. 9 is a block diagram of an exemplary system 900 for distributedmetric data time rollup in real-time as disclosed in this patentdocument, including the techniques disclosed with respect to FIGS. 1A,1B, 1C, 1D, and 1E. The system 900 in FIG. 9 includes client device 905and 992, mobile device 915, network 920, network server 925, applicationservers 930, 940, 950 and 960, agents 912, 919, 934, 944, 954 and 964,asynchronous network machine 970, data stores 980 and 985, controller990, and data collection server 995.

The system 900 can include a metric processing system 992 that includescontrollers and aggregators. The metric processing system 992 can beimplemented substantially similar to the metric processing system 400shown in FIG. 4. As disclosed with respect to FIG. 4, the metricprocessing system 400, 850, and 992 can perform the distributed metricdata time rollup in real-time as disclosed in this patent document. Themetric processing system 992 can communicate with controller 990 toprovide the time rolled up data to the controller when requested. Insome implementations, the metric payload data from the agents arereceived through the controller 990.

Client device 905 may include network browser 910 and be implemented asa computing device, such as for example a laptop, desktop, workstation,or some other computing device. Network browser 910 may be a clientapplication for viewing content provided by an application server, suchas application server 930 via network server 925 over network 920.

Network browser 910 may include agent 912. Agent 912 may be installed onnetwork browser 910 and/or client 905 as a network browser add-on,downloading the application to the server, or in some other manner.Agent 912 may be executed to monitor network browser 910, the operatingsystem of client 905, and any other application, API, or other componentof client 905. Agent 912 may determine network browser navigation timingmetrics, access browser cookies, monitor code, and transmit data to datacollection 960, controller 990, or another device. Agent 912 may performother operations related to monitoring a request or a network at client905 as discussed herein.

Mobile device 915 is connected to network 920 and may be implemented asa portable device suitable for sending and receiving content over anetwork, such as for example a mobile phone, smart phone, tabletcomputer, or other portable device. Both client device 905 and mobiledevice 915 may include hardware and/or software configured to access aweb service provided by network server 925.

Mobile device 915 may include network browser 917 and an agent 919.Mobile device may also include client applications and other code thatmay be monitored by agent 919. Agent 919 may reside in and/orcommunicate with network browser 917, as well as communicate with otherapplications, an operating system, APIs and other hardware and softwareon mobile device 915. Agent 919 may have similar functionality as thatdescribed herein for agent 912 on client 905, and may repot data to datacollection server 960 and/or controller 990.

Network 920 may facilitate communication of data between differentservers, devices and machines of system 900 (some connections shown withlines to network 920, some not shown). The network may be implemented asa private network, public network, intranet, the Internet, a cellularnetwork, Wi-Fi network, VoIP network, or a combination of one or more ofthese networks. The network 920 may include one or more machines such asload balance machines and other machines.

Network server 925 is connected to network 920 and may receive andprocess requests received over network 920. Network server 925 may beimplemented as one or more servers implementing a network service, andmay be implemented on the same machine as application server 930 or oneor more separate machines. When network 920 is the Internet, networkserver 925 may be implemented as a web server.

Application server 930 communicates with network server 925, applicationservers 940 and 950, and controller 990. Application server 930 may alsocommunicate with other machines and devices (not illustrated in FIG. 9).Application server 930 may host an application or portions of adistributed application. The host application 932 may be in one of manyplatforms, such as including a Java, PHP, .Net, and Node.JS, beimplemented as a Java virtual machine, or include some other host type.Application server 930 may also include one or more agents 934 (i.e.“modules”), including a language agent, machine agent, and networkagent, and other software modules. Application server 930 may beimplemented as one server or multiple servers as illustrated in FIG. 9.

Application 932 and other software on application server 930 may beinstrumented using byte code insertion, or byte code instrumentation(BCI), to modify the object code of the application or other software.The instrumented object code may include code used to detect callsreceived by application 932, calls sent by application 932, andcommunicate with agent 934 during execution of the application. BCI mayalso be used to monitor one or more sockets of the application and/orapplication server in order to monitor the socket and capture packetscoming over the socket.

In some embodiments, server 930 may include applications and/or codeother than a virtual machine. For example, servers 930, 940, 950, and960 may each include Java code, .Net code, PHP code, Ruby code, C code,C++ or other binary code to implement applications and process requestsreceived from a remote source. References to a virtual machine withrespect to an application server are intended to be for exemplarypurposes only.

Agents 934 on application server 930 may be installed, downloaded,embedded, or otherwise provided on application server 930. For example,agents 934 may be provided in server 930 by instrumentation of objectcode, downloading the agents to the server, or in some other manner.Agent 934 may be executed to monitor application server 930, monitorcode running in a virtual machine 932 (or other program language, suchas a PHP, .Net, or C program), machine resources, network layer data,and communicate with byte instrumented code on application server 930and one or more applications on application server 930.

Each of agents 934, 944, 954 and 964 may include one or more agents,such as language agents, machine agents, and network agents. A languageagent may be a type of agent that is suitable to run on a particularhost. Examples of language agents include a JAVA agent, .Net agent, PHPagent, and other agents. The machine agent may collect data from aparticular machine on which it is installed. A network agent may capturenetwork information, such as data collected from a socket.

Agent 934 may detect operations such as receiving calls and sendingrequests by application server 930, resource usage, and incomingpackets. Agent 934 may receive data, process the data, for example byaggregating data into metrics, and transmit the data and/or metrics tocontroller 990. Agent 934 may perform other operations related tomonitoring applications and application server 930 as discussed herein.For example, agent 934 may identify other applications, share businesstransaction data, aggregate detected runtime data, and other operations.

An agent may operate to monitor a node, tier or nodes or other entity. Anode may be a software program or a hardware component (e.g., memory,processor, and so on). A tier of nodes may include a plurality of nodeswhich may process a similar business transaction, may be located on thesame server, may be associated with each other in some other way, or maynot be associated with each other.

A language agent may be an agent suitable to instrument or modify,collect data from, and reside on a host. The host may be a Java, PHP,.Net, Node.JS, or other type of platform. Language agent may collectflow data as well as data associated with the execution of a particularapplication. The language agent may instrument the lowest level of theapplication to gather the flow data. The flow data may indicate whichtier is communicating with which tier and on which port. In someinstances, the flow data collected from the language agent includes asource IP, a source port, a destination IP, and a destination port. Thelanguage agent may report the application data and call chain data to acontroller. The language agent may report the collected flow dataassociated with a particular application to a network agent.

A network agent may be a standalone agent that resides on the host andcollects network flow group data. The network flow group data mayinclude a source IP, destination port, destination IP, and protocolinformation for network flow received by an application on which networkagent is installed. The network agent may collect data by interceptingand performing packet capture on packets coming in from a one or moresockets. The network agent may receive flow data from a language agentthat is associated with applications to be monitored. For flows in theflow group data that match flow data provided by the language agent, thenetwork agent rolls up the flow data to determine metrics such as TCPthroughput, TCP loss, latency and bandwidth. The network agent may thenreport the metrics, flow group data, and call chain data to acontroller. The network agent may also make system calls at anapplication server to determine system information, such as for examplea host status check, a network status check, socket status, and otherinformation.

A machine agent may reside on the host and collect information regardingthe machine which implements the host. A machine agent may collect andgenerate metrics from information such as processor usage, memory usage,and other hardware information.

Each of the language agent, network agent, and machine agent may reportdata to the controller. Controller 990 may be implemented as a remoteserver that communicates with agents located on one or more servers ormachines. The controller may receive metrics, call chain data and otherdata, correlate the received data as part of a distributed transaction,and report the correlated data in the context of a distributedapplication implemented by one or more monitored applications andoccurring over one or more monitored networks. The controller mayprovide reports, one or more user interfaces, and other information fora user.

Agent 934 may create a request identifier for a request received byserver 930 (for example, a request received by a client 905 or 915associated with a user or another source). The request identifier may besent to client 905 or mobile device 915, whichever device sent therequest. In embodiments, the request identifier may be created when adata is collected and analyzed for a particular business transaction.

Each of application servers 940, 950 and 960 may include an applicationand agents. Each application may run on the corresponding applicationserver. Each of applications 942, 952 and 962 on application servers940-960 may operate similarly to application 932 and perform at least aportion of a distributed business transaction. Agents 944, 954 and 964may monitor applications 942-962, collect and process data at runtime,and communicate with controller 990. The applications 932, 942, 952 and962 may communicate with each other as part of performing a distributedtransaction. In particular, each application may call any application ormethod of another virtual machine.

Asynchronous network machine 970 may engage in asynchronouscommunications with one or more application servers, such as applicationserver 950 and 960. For example, application server 150 may transmitseveral calls or messages to an asynchronous network machine. Ratherthan communicate back to application server 950, the asynchronousnetwork machine may process the messages and eventually provide aresponse, such as a processed message, to application server 960.Because there is no return message from the asynchronous network machineto application server 950, the communications between them areasynchronous.

Data stores 980 and 985 may each be accessed by application servers suchas application server 950. Data store 985 may also be accessed byapplication server 950. Each of data stores 980 and 985 may store data,process data, and return queries received from an application server.Each of data stores 980 and 985 may or may not include an agent.

Controller 990 may control and manage monitoring of businesstransactions distributed over application servers 930-960. In someembodiments, controller 990 may receive application data, including dataassociated with monitoring client requests at client 905 and mobiledevice 915, from data collection server 960. In some embodiments,controller 990 may receive application monitoring data and network datafrom each of agents 912, 919, 934, 944 and 954. Controller 990 mayassociate portions of business transaction data, communicate with agentsto configure collection of data, and provide performance data andreporting through an interface. The interface may be viewed as aweb-based interface viewable by client device 992, which may be a mobiledevice, client device, or any other platform for viewing an interfaceprovided by controller 990. In some embodiments, a client device 992 maydirectly communicate with controller 990 to view an interface formonitoring data.

Client device 992 may include any computing device, including a mobiledevice or a client computer such as a desktop, work station or othercomputing device. Client computer 992 may communicate with controller990 to create and view a custom interface. In some embodiments,controller 990 provides an interface for creating and viewing the custominterface as a content page, e.g., a web page, which may be provided toand rendered through a network browser application on client device 992.

Applications 932, 942, 952 and 962 may be any of several types ofapplications. Examples of applications that may implement applications932-962 include a Java, PHP, .Net, Node.JS, and other applications.

FIG. 4 is a block diagram of a computer system 400 for implementing thepresent technology. System 400 of FIG. 4 may be implemented in thecontexts of the likes of clients 905, 992, mobile device 915, networkserver 925, servers 930, 940, 950, 960, a synchronous network machine970 and controller 990.

The computing system 1000 of FIG. 10 includes one or more processors1010 and memory 1020. Main memory 1020 stores, in part, instructions anddata for execution by processor 1010. Main memory 1010 can store theexecutable code when in operation. The system 1000 of FIG. 10 furtherincludes a mass storage device 1030, portable storage medium drive(s)1040, output devices 1050, user input devices 1060, a graphics display1070, and peripheral devices 1080.

The components shown in FIG. 10 are depicted as being connected via asingle bus 1090. However, the components may be connected through one ormore data transport means. For example, processor unit 1010 and mainmemory 1020 may be connected via a local microprocessor bus, and themass storage device 1030, peripheral device(s) 1080, portable or remotestorage device 1040, and display system 1070 may be connected via one ormore input/output (I/O) buses.

Mass storage device 1030, which may be implemented with a magnetic diskdrive or an optical disk drive, is a non-volatile storage device forstoring data and instructions for use by processor unit 1010. Massstorage device 1030 can store the system software for implementingembodiments of the present invention for purposes of loading thatsoftware into main memory 620.

Portable storage device 1040 operates in conjunction with a portablenon-volatile storage medium, such as a compact disk, digital video disk,magnetic disk, flash storage, etc. to input and output data and code toand from the computer system 1000 of FIG. 10. The system software forimplementing embodiments of the present invention may be stored on sucha portable medium and input to the computer system 1000 via the portablestorage device 1040.

Input devices 1060 provide a portion of a user interface. Input devices1060 may include an alpha-numeric keypad, such as a keyboard, forinputting alpha-numeric and other information, or a pointing device,such as a mouse, a trackball, stylus, or cursor direction keys.Additionally, the system 1000 as shown in FIG. 10 includes outputdevices 1050. Examples of suitable output devices include speakers,printers, network interfaces, and monitors.

Display system 1070 may include a liquid crystal display (LCD) or othersuitable display device. Display system 1070 receives textual andgraphical information, and processes the information for output to thedisplay device.

Peripherals 1080 may include any type of computer support device to addadditional functionality to the computer system. For example, peripheraldevice(s) 1080 may include a modem or a router.

The components contained in the computer system 1000 of FIG. 10 caninclude a personal computer, hand held computing device, telephone,mobile computing device, workstation, server, minicomputer, mainframecomputer, or any other computing device. The computer can also includedifferent bus configurations, networked platforms, multi-processorplatforms, etc. Various operating systems can be used including Unix,Linux, Windows, Apple OS, and other suitable operating systems,including mobile versions.

When implementing a mobile device such as smart phone or tabletcomputer, the computer system 1000 of FIG. 10 may include one or moreantennas, radios, and other circuitry for communicating over wirelesssignals, such as for example communication using Wi-Fi, cellular, orother wireless signals.

While this patent document contains many specifics, these should not beconstrued as limitations on the scope of any invention or of what may beclaimed, but rather as descriptions of features that may be specific toparticular embodiments of particular inventions. Certain features thatare described in this patent document in the context of separateembodiments can also be implemented in combination in a singleembodiment. Conversely, various features that are described in thecontext of a single embodiment can also be implemented in multipleembodiments separately or in any suitable subcombination. Moreover,although features may be described above as acting in certaincombinations and even initially claimed as such, one or more featuresfrom a claimed combination can in some cases be excised from thecombination, and the claimed combination may be directed to asubcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particularorder, this should not be understood as requiring that such operationsbe performed in the particular order shown or in sequential order, orthat all illustrated operations be performed, to achieve desirableresults. Moreover, the separation of various system components in theembodiments described in this patent document should not be understoodas requiring such separation in all embodiments.

Only a few implementations and examples are described and otherimplementations, enhancements and variations can be made based on whatis described and illustrated in this patent document.

1.-10. (canceled)
 11. A method for distributed consistent hash backedtime rollup of performance metric data, the method including: receiving,at a plurality of collectors, time series metrics data for a pluralityof performance metrics from one or more agents instrumented into one ormore monitored applications; aggregating, at a plurality of aggregatorscommunicatively connected to the collectors to form a consistent hashring, the received time series metrics data for the plurality ofperformance metrics, wherein each aggregator is assigned to aggregateall received time series metrics data for one or more of the pluralityof performance metrics; determining, at a coordinator communicativelyconnected to the plurality of collectors and the plurality ofaggregators, whether the hash ring has changed; and communicating, atthe coordinator, information on the determined change to the pluralityof collectors. 12.-15. (canceled)
 16. The method of claim 25, including:writing, to a database, the accumulated value for the removed aggregatorobtained from the time series metrics data received at the removedaggregator before removing the aggregator; and writing, to the database,the accumulated value for the next aggregator obtained from the timeseries metrics data received at the next aggregator after removing theaggregator. 17-19. (canceled)
 20. The method of claim 25, including:writing, to a database, the accumulated value for the one of theaggregators obtained from the time series metrics data received at theone of the aggregators before adding the aggregator; and writing, thethe database, the accumulated value for the newly added aggregatorobtained from the time series metrics data received at the newly addedaggregator after adding the newly added aggregator.
 21. The method ofclaim 16, including: merging the two accumulated values to perform atime roll up for a time period.
 22. A non-transitory computer readablemedium embodying instructions when executed by a processor to causeoperations to be performed including: receiving, at a plurality ofcollectors, time series metrics data for a plurality of performancemetrics from one or more agents instrumented into one or more monitoredapplications; aggregating, at a plurality of aggregators communicativelyconnected to the collectors to form a hash ring, the received timeseries metrics data for the plurality of performance metrics, whereineach aggregator is assigned to aggregate all received time seriesmetrics data for one or more of the plurality of performance metrics;determining, at a coordinator communicatively connected to the pluralityof collectors and the plurality of aggregators, whether one of theplurality of the aggregators in the hash ring has crashed; andperforming a repair job to fix data corruption caused by the crashedaggregator.
 23. The non-transitory computer readable medium of claim 22,wherein the instructions when executed by the processor can causeoperations to perform operations, including: redistributing the receivedtime series metrics data for the one or more of the plurality ofperformance metrics assigned to the crashed aggregator to nextaggregator in the hash ring.
 24. The non-transitory computer readablemedium of claim 22, wherein the instructions when executed by theprocessor can cause operations to perform the repair job, including:obtaining the time series metrics data received at the crashedaggregator before the crash; and merging the obtained time series metricdata from the crashed aggregator with the time series metrics dataredistributed to the next aggregator in the hash ring.
 25. Thenon-transitory computer readable medium of claim 22, wherein theinstructions when executed by the processor can cause operations toperform operations, including: splitting the time series metrics datareceived at the crashed aggregator before the crash into smaller timeseries.
 26. The method of claim 11, including: accumulating the timeseries metrics data for the one or more of the plurality of performancemetrics assigned to one of the aggregators received at the one of theaggregators before adding a new aggregator; and accumulating the timeseries metrics data for the one or more of the plurality of performancemetrics assigned to the one of the aggregators received at the newaggregator after adding the new aggregator to obtain an accumulatedvalue for the new aggregator.
 27. The non-transitory computer readablemedium of claim 22, wherein the instructions when executed by theprocessor can cause operations to perform operations, including:accumulating the time series metrics data for the one or more of theplurality of performance metrics assigned to one of the aggregatorsreceived at the one of the aggregators before adding a new aggregator;and accumulating the time series metrics data for the one or more of theplurality of performance metrics assigned to the one of the aggregatorsreceived at the new aggregator after adding the new aggregator to obtainan accumulated value for the new aggregator.